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Abstract 

We  describe  a  parallel  version  of  the  shortest  augmenting  path  algorithm 
for  the  assignment  problem.  While  generating  the  initial  dual  solution  and 
partial  assignment  in  parallel  does  not  require  substantive  changes  in  the 
sequential  algorithm,  using  several  augmenting  paths  in  parallel  does  require 
a  new  dual  variable  recalculation  method.  The  parallel  algorithm  was  tested 
on  a  14-processor  Butterfly  Plus  computer,  on  problems  with  up  to  9C0  million 
variables.  The  speedup  obtained  increases  with  problem  size.  The  algorithm 
was  also  embedded  into  a  parallel  branch  and  bound  procedure  for  the  traveling 
salesman  problem  on  a  directed  graph,  which  was  tested  on  the  Butterfly  Plus 
on  problems  involving  up  to  7,500  cities.  To  our  knowledge,  these  are  the 

largest  assignment  problems  and  traveling  salesman  problems  solved  so  far. 
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1.  Introduction 


The  assignment  problem  can  be  defined  either  on  a  directed  graph,  in 
which  case  an  assignment  (a  solution)  is  a  spanning  union  of  directed  cycles, 
or  on  an  undirected  bipartite  graph,  in  which  case  an  assignment  is  a  perfect 
matching.  We  work  with  this  latter  formulation.  Given  a  bipartite  graph 
G  =  (SuT,  A)  with  arc  costs  c  ,  (i,j)cA,  where  |S|  =  |T|  =  n,  the  assignment 
problem  (AP)  asks  for  a  pairing  (matching,  assignment)  of  the  nodes  in  S  to 
those  in  T  that  minimizes  the  sum  of  costs  of  arcs  in  the  pairing.  It  can  be 
stated  as 

(1)  min  I(c  x  : ieS,  jeT) 
subject  to 


«v 

jeT)  =  1 

ieS 

z(v 

ieS)  =  1 

JeT 

(3)  x  e{0,l},  ( i , j )eA 

where  x  =  1  if  and  only  if  node  ieS  is  paired  with  node  jeT,  i.e.  arc  (i,j) 

^  J  • 

is  in  the  matching.  Condition  (3)  can  be  replaced  by 

(4)  x  £  0,  ( i, j)eA 

because  the  resulting  linear  program  has  only  integer  basic  solutions  due  to 
the  total  unimodularity  of  the  coefficient  matrix  of  the  system  (2). 

The  linear  program  dual  to  (AP)  can  be  stated  as 

(5)  max  Ku^ieS)  +  Z(v^:jeT) 
subject  to 

(6)  u  +  v  s  c  ,  (i, j)eA. 

t  J  iJ  J 

It  is  well  known  that  a  vector  x  satisfying  (2),  (3)  is  an  optimal 

assignment  if  and  only  if  there  exists  a  vector  (u,v)  satisfying  (6)  and  such 
that 
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(7) 

The  numbers  c 


c 


U 


f  =  0  if  x  =1 

i  j 

-  u  -  V  - 

J  £  0  otherwise. 

v. 

c  -  u  -  v  are  called  reduced  costs. 


ij  iJ  i  J 

Both  the  primal  and  dual  simplex  methods  have  their  specialized  versions 
for  the  assignment  problem  [3,2,21].  The  most  popular  early  approach  known  as 
the  Hungarian  method  [18,12,19,8,20],  can  be  viewed  as  primal-dual  in  nature, 
although  in  fact  it  never  uses  or  produces  a  primal  basis.  The  same  can  be 
said  about  the  Shortest  Augmenting  Path  method  [6,9,11,15,28],  which  treats 
(AP)  as  a  specialized  minimum  cost  network  flow  problem.  Our  reason  for 
choosing  this  latter  approach  for  parallelization  is  that  (a)  we  consider  it 
to  be  the  most  promising  and  (b)  it  contains  parallelisms  of  somewhat  higher 
order  of  granularity  than  the  other  approaches.  A  different  approach  that 
easily  lends  itself  to  parallelization,  is  based  on  relaxation  techniques  [4], 
We  note  that  the  rows  and  columns  of  the  cost  matrix  c  correspond  to  the 
nodes  in  S  and  in  T,  respectively.  Throughout  the  paper,  we  will  refer  to 
elements  of  S  (of  T)  either  as  nodes,  or  as  rows  (as  columns). 


2.  The  Sequential  Algorithm 

The  procedure  starts  by  finding  an  initial  solution  (Phase  0),  i.e.  a 
partial  assignment  that  is  optimal  in  the  subgraph  induced  by  the  assigned 
nodes.  This  is  achieved  by  constructing  a  basic  feasible  solution  to  the  dual 
of  (AP)  and  then  finding  a  maximum  cardinality  matching  in  the  subgraph  having 
only  those  arcs  with  zero  reduced  cost.  We  do  this  in  O(n-z)  time,  where  z  is 
th*»  number  of  arcs  with  zero  reduced  costs,  and  it  typically  yields  a 
considerably  larger  number  of  initial  pairings  than  the  usual  greedy 
heuristics. 

Next  the  procedure  enters  an  alternating  sequence  of  two  phases: 
augmenting  path  finding  (Phase  1)  and  updating  (recalculation)  of  the  dual 
variables  and  of  the  (primal)  assignment  (Phase  2). 
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In  Phase  1,  an  unassigned  node  ieS  is  selected  and  a  shortest  (with 

respect  to  the  current  reduced  costs)  augmenting  path  is  found  from  it  to  some 

unassigr.ed  node  jd.  This  is  done  by  growing  an  alternating  tree  rooted  at 

2 

node  i  by  a  slightly  modified  version  of  Dijkstra’s  0(n  )  labeling  procedure 
[10].  The  modification,  necessitated  by  the  fact  that  the  shortest  path  to  be 
found  has  to  be  an  alternating  one,  consists  in  restricting  Dijkstra’s 
selection  rule  to  successors  of  nodes  in  S,  while  the  successors  of  nodes  in  T 
are  uniquely  determined  and  thus  leave  no  choice. 

In  Phase  2  the  labels  generated  in  Phase  1  are  used  to  calculate  new 
values  for  the  dual  variables,  and  the  augmenting  path  is  used  to  generate  a 
new  assignment,  namely  the  symmetric  difference  between  the  old  assignment  and 
the  augmenting  path.  The  new  assignment  matches  one  more  pair  of  nodes  than 
the  previous  one.  The  algorithm  stops  when  all  the  nodes  have  been  paired. 

A  more  precise  statement  of  the  algorithm  follows. 

begin 

*  *  *  Phase  0  *  •  • 
begin 


u(  :=  min{ci ^ :  jcT> ,  icS 

v  :=  min{c  -  u  :ieS),  jcT 
_J  U  i 

A  :=  {(i,j)sA:c  -  u  -  v  =  0 > 
U  i  J 

find  a  maximum  matching  A  in  G 


end 

while  | A  |  <  n  do 
•  •  •  Phase  1  *  *  • 
begin 


(SuT, A) 


choose  an  unassigned  row  i  eS 

*  *  initialize  the  set  of  unlabeled  (UC)  and  labeled  (LC)  columns  *  *  * 

UC  :*  r,  LC  :=  a 

*  *  initialize  the  labels  A^  and  the  predecessors  p^  *  *  * 

for  each  jcT  do  A  :=  c  -u  -v,p  :=  0 
J  ij  J  *J 

*  •  find  shortest  augmenting  path  •  •  * 

repeat 

find  jeUC  with  A  =  min{A  : keUC} 

J  k 
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UC  :=  UCMJ},  LC  :=  LCu{j} 

if  J  is  assigned  then 
begin 

i  :=  row  assigned  to  column  j 
for  each  keUC  do 
begin 

A  :  =  A  +  c  -u  -  v 

J  lk  1  _k 

if  A  <  A  then  A  :=  A,  p 
k  k  Me 

end 

end 

until  j  is  unassigned 

end 

Phase  2  *  *  • 

update  dual  variables  •  *  * 
for  each  keLC\{j> 
begin 

i  :=  row  assigned  to  column  k 

v  :  =  v  +  A  -  A 
k  k  k  J 

U  :  *  U  -  A  +  A  . 

1  1  k  J 


=  J 


l 


end 

:  *  u 

l 


+  A 


J 


1  1 

update  current  assignment  *  *  * 
while  p  *0  do 


J 

begin 


i  :=  row  assigned  to  column  p 
A  :=  A  v{  (i,  j)}\{.(i,p  )} 

J  :  =  P, 


J 


end; 


A  :  =  A  (  i  ,  j ) } 


end 


When  the  problems  to  be  solved  are  large,  it  pays  to  use  sparse  matrix 
techniques,  i.e.  to  restrict  the  search  for  a  successor  in  the  augmenting 
paths  to  the  smallest  k  elements  of  each  row  of  the  reduced  cost  matrix  for 
some  properly  chosen  k,  and  check  for  dual  feasibility  of  the  solution  found 
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at  the  end  of  the  procedure.  If  dual  feasibility  is  violated,  the 
corresponding  rows  and  columns  must  be  reassigned;  but  by  proper  choice  of  k, 
the  probability  of  a  need  for  such  reassignments  can  be  kept  rather  low  (see 
[7,8]  for  a  discussion  of  this  procedure). 

3.  Parallelization:  General  Considerations 

The  efficiency  of  parallelization  is  usually  measured  by  the  speedup, 
i.e.  the  ratio  between  the  times  needed  to  solve  a  given  problem  with  a  single 
processor  and  with  p  processors.  The  speedup  in  turn  depends  on  the  time 
spent  by  all  the  processors  on  actual  computing,  versus  the  time  spent  on 
inter-processor  communication  or  idling. 

A  key  factor  that  affects  the  efficiency  of  a  parallelization  is  its 
granularity.  High  granularity  parallelism  is  one  that  allows  each  processor 
to  execute  substantial  amounts  of  computation  before  the  need  for 
communication  arises.  Low  granularity  parallelism  is  one  that  requires 
frequent  communication  and  data  transfer  between  processors.  An  example  of 
the  first  type  of  parallelism  is  a  branch  and  bound  procedure  in  which  every 
processor  works  on  a  different  subproblem;  while  an  example  of  the  second  type 
is  a  sorting  algorithm  in  which  each  processor  compares  two  items  and  passes 
on  the  result.  On  most  existing  parallel  computers  higher  granularity 
parallelization  yields  a  higher  speedup. 

Another  factor  that  affects  the  efficiency  of  parallelization  is  the 
frequency  of  synchronization  points  in  the  procedure.  From  time  to  time  the 
processors  that  have  finished  a  certain  task  have  to  wait  until  all  others 
finish  the  same  task,  in  order  to  exchange  some  information  needed  for 
continuation.  This  may  create  a  substantial  amount  of  idle  time,  so  the  less 
frequent  the  synchronization,  the  more  efficient  the  procedure  will  tend  to 
be. 
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Last,  but  not  least,  the  efficiency  of  parallelization  depends,  for  a 
given  algorithm,  on  the  architecture  of  the  computer  used  Parallel  computers 
may  be  classified  into  Single  Instruction  Multiple  Data  (SIMD)  and  Multiple 
Instructions  Multiple  Data  (MIMD).  Within  the  latter  category,  one  may 
distinguish  between  Uniform  Memory  Access  (UMA),  Non-Uniform  Memory  Access 
(NUMA),  and  No  Remote  Memory  Access  (NORMA)  computers.  In  UMA  machines  all 
memory  access  is  uniform,  i.e.  access  times  are  independent  of  memory 
location.  NUMA  machines  distribute  memory  across  processors,  thus  some  access 
(local)  takes  less  time  than  other  (remote).  The  difference  may  be  small  or 
large,  depending  on  the  kind  of  connection  used  for  remote  memory  access. 
NORMA  machines  allow  no  remote  memory  access,  all  interprocessor  communication 
takes  place  via  explicit  message  passing.  Our  implementation  is  designed 
primarily  for  a  NUMA  architecture  where  the  ratio  between  remote  and  local 
access  times  is  on  the  order  of  ten  (the  ratio  for  our  computer).  Larger 
access  time  ratios  may  require  modifying  some  of  the  implementation  details. 
The  algorithm  should  port  as  described  to  most  UMA  architectures.  The  only 
architectural  requirements  are  interprocessor  communication  via  shared  memory, 
atomic  operations  on  selected  memory  locations,  and  processor  synchronization. 
Implementation  on  NORMA  architectures  should  also  be  possible,  but  the  data 
placement  strategy  must  be  modified  according  to  the  capabilities  of  the 
machine. 

For  a  discussion  of  concepts  and  issues  related  to  parallel  computing  see 
[17,26,25,13].  In  our  approach,,  the  cost  matrix  is  evenly  distributed  across 
processor  memories  in  contiguous  blocks  of  rows.  This  partitioning  scheme 
makes  it  possible  to  store  quite  large  matrices.  In  view  of  the  non-uniform 
memory  access  times,  calculations  are  designed  so  that  each  processor  works 
primarily  with  the  rows  stored  in  local  memory.  Each  processor’s  local  memory 
holds  a  row  buffer.  Whenever  a  processor  needs  access  to  a  row  stored  in 
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another  processor’s  memory,  it  copies  the  row  in  question  into  its  row  buffer, 
which  it  then  uses  to  access  locally  individual  row  elements  one  at  a  tinr*,  as 
needed. 

The  current  assignment  A  is  stored  as  a  single  predecessor  /successor 

1  ist. 


The  dual  variables  u^  and  v^  are  stored  as  follows.  Every  processor 

keeps  a  copy  of  each  column  variable  v  ,  but  there  is  only  one  centrally 

stored  copy  of  the  row  variables  u  .  The  reason  for  this  is  that  the  reduced 

costs  c  =  c  -  u  -  v  are  calculated  row-wise,  i.e.  u  remains  unchanged 
U  lj  i  J  i 

for  an  entire  row  while  v^  changes  with  every  c  .  This  way  there  are  no 
simultaneous  memory  access  requests  for  the  frequently  used  column  variables, 
and  only  a  small  chance  of  simultaneous  access  requests  for  the  much  less 
frequently  needed  row  variables.  Besides  the  aspect  of  minimizing  conflicting 
memory  access  requests,  this  storage  scheme  also  capitalizes  on  the  fact  that 
local  memory  access  is  cheaper  than  remote  access. 

The  specifics  of  the  parallelization  in  each  phase  are  discussed  below. 


4.  The  Initial  Solution  Phase 

The  calculation  of  the  initial  feasible  solution  to  the  dual  of  (AP)  is 
done  in  parallel.  First,  each  processor  calculates  the  value  of  as  the 
minimum  of  c^,  jeT,  for  each  of  its  own  rows.  Then  each  processor 

calculates,  for  each  column  jeT,  the  minimum  of  c  -  u  over  its  own  set  of 
rows,  and  one  processor  calculates  v^  as  the  smallest  of  these  partial  minima. 

Finally,  each  processor  calculates  the  reduced  costs  c  -  u  -  v  for  its  own 

U  i  j 

rows,  and  constructs  its  part  of  the  admissible  graph  G  :=  ( SuT , A ) ,  where  A  : = 
{(i,J)eA  :  c  -  u^  -  v^  =  0} 

Next,  a  maximum  matching  is  found  in  the  admissible  graph  by  a 
sequential  algorithm.  The  time  required  for  this  is  too  small  to  justify 
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parallelization,  which  could  only  be  of  relatively  low  granularity. 

Next  the  sparse  matrix  structure  is  created,  again  in  parallel,  by  having 
each  processor  find  the  k  smallest  costs  in  each  of  its  rows.  For  problems 

with  n  £  S00  we  used  k  =  n/50,  and  the  solution  obtained  in  this  way  'was 

always  dual  feasible. 

S  The  Augmenting  Path  Finding  Phase 

At  the  end  of  Phase  0,  q  rows  and  q  columns  have  been  assigned  to  each 
other,  for  some  positive  integer  q  s  n,  and  n-q  rows  and  columns  are 
unassigned.  In  principle  there  are  two  ways  in  which  the  search  for  augmenting 
paths  can  be  parallelized:  (i)  have  each  processor  choose  a  different 

unassigned  node  in  S  and  apply  the  labeling  technique  in  an  attempt  to  find  am 
augmenting  path.  (ii)  have  several  processors  jointly  attempt  to  find  an 

augmenting  path  from  some  unassigned  node  in  S.  In  the  first  case,  several 

augmenting  paths  are  generated  in  parallel,  but  each  path  is  found  by  a  single 
sequential  procedure.  In  the  second  case,  each  path  is  constructed  in 
parallel.  The  second  kind  of  parallelization  is  of  lower  granularity,  so  on 
most  existing  parallel  computers  the  first  kind  ought  to  have  priority.  In 
fact,  a  preliminary  investigation  of  the  second  kind  of  parallelization  showed 
so  little  promise  on  our  computer  that  we  have  so  far  not  implemented  it.  (On 
other  types  of  computers  it  may  be  advantageous  to  implement  both  (i)  and 

(ii)).  The  rest  of  this  discussion  will  concentrate  on  (i). 

Suppose  each  processor  uses  the  labeling  technique  to  construct  an 

alternating  tree  rooted  at  a  different  node,  and  as  a  result  finds  an 

augmenting  path.  Can  all  the  pajihs  found  in  this  way  be  used  simultaneously 

to  augment  the  current  partial  Assignment  while  leaving  it  optimal?  If  the 

\ 

alternating  trees  found  by  labeling  are  pairwise  disjoint,  then  clearly 

performing  the  augmentation  and  ^ie  recalculation  of  the  dual  variables  in 
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parallel  has  the  same  effect  as  performing  them  sequentially,  hence  the 
procedure  is  legitimate.  On  the  other  hard,  if  two  augmenting  paths  have  some 
node  in  common,  then  one  of  them  cannot  be  used  for  augmentation.  Thus  the 
paths  used  for  augmentation  have  to  be  disjoint.  The  crucial  question  is, 
what  happens  if  we  have  a  collection  of  augmenting  paths  that  are  pairwise 
disjoint,  but  the  "corresponding  alternating  trees  are  not?  In  this  case 
performing  the  augmentation  and  the  recalculation  of  the  dual  variables  in 
parallel  may  not  have  the  same  effect  as  doing  them  sequentially,  and  in  the 
sequential  case  it  may  even  seem  doubtful  whether  using  a  first  path  for 
augmentation  leaves  the  remaining  ,aths  "shortest1'  in  terms  of  the  modified 
reduced  costs.  The  next  Theorem  dispels  these  doubts  by  showing  that  if  che 
procedure  for  updating  the  dual  variables  is  duly  modified,  then  any  set  of 
pairwise  disjoint  augmenting  paths  can  be  used  in  parallel,  even  when  their 

associated  alternating  trees  collide. 

•  —  _ 

Let  A  be  a  matching  of  the  nodes  of  S  £  S  to  those  of  T  £  T,  and  let 

2n 

(u,v)eiR  satisfy 


(3) 


u  *  v 


( i ,  j)eA 


1  l  =  C!j  (i.j)cA  . 

Condition  (3)  is  necessary  and  sufficient  for  A  to  be  a  minimum-cost 

perfect  matching  in  the  subgraph  induced  by  SuT. 

For  h  =  1 . m,  let  P  be  an  augmenting  path  with  resDect  to  A  from 

h 

node  i  eS\S  to  node  j  cT\T,  let  LC  be  the  set  of  columns  labeled  in  the 

h  h  h 

process  of  generating  P  ,  and  for  jrLC  ,  let  Ah  be  the  label  assigned  column  j 

in  that  process.  Further,  let  M  : =  { 1 . m> ,  and  let  LC  be  the  set  of  all 

columns  labeled  while  generating  the  m  augmenting  paths  P  ,  hcM,  i.e.  LC  := 

h 

u  LC  . 

h 

hen 

Assume  now  that  the  augmenting  paths  P  ,  heM,  are  pairwise  node  disjoint, 

h 


and  let  A  be  the  matching  in  G  obtained  by  augmenting  A  along  each  of  the 


9 


paths  P  ■,  heM,  i.e. 

A  :  *  A  ®(  u  P  ) , 
heM  h 

where  a  denotes  the  symmetric  difference  (for  sets  X,  Y,  X©Y  =  XuY\Xr>Y).  Then 
clearly  A  is  a  matching  of  the  nodes  of  SuK  i  ,  .  .  .  ,  i  >  to  those  of 

1  31 

—  •  • 

To/(j . j  >.  Furthermore,  A  has  the  following  property. 

1  a 

Theorem.  For  jcT,  let  i(j)  be  the  row  assigned  to  column  j  by  A  ,  and 


def i ne 


v  -  max  (Ah  -  Ah> 

J  h.-JCLC  Jh  J 
h 


if  jcLC 


otherwise 


u  +  max 
l 

h:  JCLC 


{ A  h  -  A*} 


if  i  =  i C J )  for  some  jcLC 


u  =  1  u  +  A 
I  I  ‘  J. 


if  i  =  i  for  some  heM 
h 

otherwise. 


u  +  v 
i  J 


for  (i,j)cA 
for  (i.jlcA 


Proof.  For  i,j)cA,  we  will  denote  by  c  and  c  the  reduced  costs 

ij  ij 

•  —  • 

associated  with  A  and  A  ,  respectively,  i.e.  c  :  =  c  -  u  -  v  ,  c  :  = 

3  U  ij  J  J  U 

*  •  h  h 

c  -  u  -  v  .  Also,  for  each  heM,  we  will  denote  by  u  and  v  the  dual 
U  i  J  i  J 

variables,  and  by  ch  the  reduced  costs  (i.e.  ch  :=  c  -  uh  -  vh)  that  would 

tj  ij  iJ  i  J 

be  obtained  if  A  were  augmented  by  using  only  the  path  P  .  Since  each  P  is 

h  h 

a  short? ~t  augmenting  path,  it  is  -ell  known  that  c^  2  0  for  all  (i,j)cA  and 
all  heM,  and  c  =0  for  all  (i,J)eA  oP  . 

i  J  h 

Our  proof  of  c  2  0  will  consist  in  showing  for  every  (i,j)cA,  either 

•  •  h 

that  c  2  c  ,  or  that  c  2c  for  some  heM. 

1 J  » J  1 J  i  J 

For  this  purpose,  we  first  state  explicitly  the  values  of  uj\  and  v*\ 


Recall  that  P  connects  i  cS\S  to  J  cT\T  and  that  i(J)  is  the  row  assigned  by 

h  h  h 
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A  to  column  j.  For  heM,  we  have 


v  -  Ah  +  Ah 
J  Jh  J 


if  jeLC 


otherwise 


,  h  .  h 

u  +  A  -  A 

1  Jh  J 


u ■  V  u  +  A 
‘  1  i  J. 


if  i  =  i(j)  for  some  jeLC 


if  i  =  i 


[  otherwise 

Note  that,  by  construction,  the  labels  satisfy  Ah  a  Ah  for  all  jcLC  and 

Jh  J  h 


all  heM. 


We  now  examine  c  for  different  positions  of  i  and  j.  First  if  icS\S 


and  i*i  for  all  heM,  then  u  =  u  and 
h  i  1 


c  -  u  -  v  +  max  {Ah  -  Ah> 

1J  1  J  h:  JCLC  Jh  J 

h 


c  -  u  -  V 
1J  1  J 


hence  c  a  c  a  0. 

U  U 

Next,  if  i  =  i  for  some  heM,  then  u  =  uh  and 

h  ti 

r  h  , .  k  .  it, 

c  -  u  -  v  +  max  {A  -  A  > 

lJ  1  J  k;  JCLC  Jk  J 

k 


if  jeLC 


otherwise 


if  jeLC 


h  h 

c  -  u  -  v 
‘J  i  J 


otherwise 


Since  -v  +  maoc  {Ak  -  A1'}  a  -vh  for  all  jeLC,  c  a  ch  follows. 

J  k : JCLC  Jk  J  J  lJ  1J 

k 

Now  let  ieS,  namely  let  i  =  i(k)  for  some  keT.  There  are  several  cases 
to  be  considered. 

Case  1.  keLC,  jeLC.  Then 

c  =  c  -  u  -  max  {Ah  -  Ah>  -  v  +  max  {Ah  -  Ah> . 

1 J  1  j  l  j  k  j  j  j 

h : keLC  h  h\  JCLC  h 

h  h 

If  j  =  k,  then  c(j  =  c  a  0.  If  j*k,  let  l  and  m  be  the  indices  for  which 
the  two  maxima  in  the  above  expression  are  attained.  Then 
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c  =  c  -  u  -  A  +A-V+A  -A 
IJ  1J  i  J.  k  J  J  J 


'  cl 
ij 


+  (A“  -  Am)  -  (A*  -  \l) 


J 


c  +  (A  -  A  ) 
iJ  J  J 

a 

•  -I 

and  hence  from  the  definition  of  m,  c  a  c  . 

»J  U 

Case  2.  keLC,  jcTNLC.  Then 

c  =  c  -  u  -  max  {A  -  A  }  -  v  a  c  , 
ij  U  i  J  k  j  ij 

h : k£LC  h 
h 

where  l  is  chosen  as  in  Case  1. 

Case  3.  keT\LC,  jeLC.  Then 


c  =  c  -  u  -  v 
ij  ij  i  J 


max  {A  -  A  >  a  c 
J  J  i  j 

h:j£LC  h 
h 


Case  4.  keTXLC,  jcTNLC.  Then 


if  jcLC 


i 


if  jeLCXLC 


l 


c  =  c  -  u  -  V  =  c  . 

IJ  IJ  1  J  IJ 

This  completes  the  proof  of  c  a  Q,  (i,j)eA. 

*•  * 

To  prove  that  (i,j)eA  implies  =  0,  it  is  sufficient  to  point  to  the 

fact  that  since  the  paths  P  ,  heM,  are  pairwise  disjoint,  every  (i,j)cA  is 

n 

**  *  -h 

contained  in  at  most  one  path  P  .  Hence  for  every  (i,j)eA  ,  c  =  c  for 

h  i  J  i  J 

the  partici  iar  heM  for  which  P  contains  (i, j),  and  hence  c  =  O.a 

h  J  1  j 


Corollary. 


is  a  minimum-cost  perfect  matching  in  the  subgraph 


induced  by  Su{  i . i  }uTu<  j  ,  .  .  .  ,  j  } . 

1  IB  1  III 

The  parallel  search  for  augmenting  paths  is  implemented  as  follows. 
After  Phase  0,  all  processors  work  simultaneously  either  on  Phase  1  or  on 
Phase  2  of  tho  algorithm,  the  two  alternating  phases  being  separated  by  a 
synchronization  stage.  During  every  Phase  1,  each  processor  chooses  an 
unassigned  node  icS  and  uses  the  labeling  technique  to  grow  an  alternating 
tree  (in  the  same  way  as  in  the  sequential  algorithm),  until  an  unassigned 
node  jeT  is  labeled;  at  which  point  a  potential  augmenting  path  from  i  to  j 
has  been  identified.  If  this  path  is  node  disjoint  from  all  the  undiscarded 
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paths  found  by  any  of  the  processors  during  the  current  phase,  it  is  stored  as 
an  actual  augmenting  path,  along  with  all  the  labels  assigned  in  the  process 
oi  finding  it;  otherwise  it  is  discarded.  In  either  case,  the  processor  in 
question  chooses  another  unassigned  node  in  S  to  search  for  another  augmenting 
path.  The  phase  ends  when  the  number  of  potential  augmenting  paths  generated 
exceeds  a  certain  fraction  a  of  the  total  number  of  unassigned  nodes.  At  that 
point  all  processors  stop  the  search  for  augmenting  paths  (synchronization) 
and  simultaneously  start  working  on  Phase  2.  Tne  value  of  a,  determined 
experimentally,  was  chosen  to  be  1/3.  This  choice  is  intended  to  balance  the 
advantages  of  longer  Phase  1  runs  against  their  disadvantages:  the  longer  a 
Phase  1  run,  the  higher  the  probability  that  the  newly  found  augmenting  paths 
collide  with  earlier  paths  found  during  the  given  Phase  1  and  have  to  be 
discarded;  and  the  shorter  a  Phase  1  run,  the  higher  the  proportion  of  time 
lost  by  each  processor  in  the  last  unfinished  run  (interrupted  by  the  call  for 
synchronization) . 

The  procedure  of  checking  each  newly  found  path  for  collision  with 
earlier  paths  as  soon  as  it  is  found,  is  a  heuristic  which  favors  simplicity 
and  ease  of  implementation  over  the  benefits  that  could  be  gained  from  storing 
all  potential  augmenting  paths  until  the  synchronization  point,  and  then 
selecting  among  them  a  maximum  number  of  pairwise  node  disjoint  ones  to  serve 
as  actual  augmenting  paths. 

During  the  labeling  procedure,  one  has  to  repeatedly  find  the  minimum  of 
a  set  of  labels.  As  the  potential  augmenting  paths  get  longer  and  longer, 
this  step  becomes  more  and  more  expensive.  To  reduce  its  cost,  we  have 
implemented  the  data  structure  known  as  d-heap  [27].  This  saved  considerable 
computational  effort  towards  the  end  of  the  procedure,  when  the  number  of 
unassigned  nodes  is  small  and  the  potential  augmenting  paths  are  very  long; 
but  it  did  not  justify  itself  in  the  earlier  stages  when  the  potential 
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augmenting  paths  are  shorter.  As  a  result,  our  current  implementation 
switches  to  the  use  of  d-heaps  only  when  the  assignment  is  98%  complete. 


6.  The  Updating  Phase 

For  the  updating  of  the  dual  variables,  the  charges  Au  :  =  u^  -  u  and 

Av  :=  v  -  v  (see  (9))  are  stored  centrally,  initialized  at  zero  at  the 
J  J  J 

beginning  of  Phase  1,  and  updated  whenever  a  processor  finds  a  new  value  that 

warrants  a  change.  The  new  values  are  actually  calculated  during  the  labeling 

procedure  of  Phase  1,  so  that  in  fact  Phase  2  consists  simply  in  putting  into 

effect  the  changes  calculated  during  Phase  1. 

To  be  specific,  the  updating  is  implemented  as  follows.  At  the  start  of 

Phase  1,  Au^  and  Av^  are  set  to  0  for  all  ieS,  jeT.  As  the  processor  working 

on  augmenting  path  P  calculates  the  value  \h  -  \h  for  column  j,  if  \h  - 

h  Jv  J  °  J.  J 

ft  u 

>  -Av  ,  then  Av  is  replaced  by  -(Xh  -  \h);  and  if  i(j)  is  the  row  assigned 
J  J  Jh  J 

to  column  j,  Au  is  replaced  by  Ah  -  Ah. 

J  KJ)  *  Jw  J 

ft 

The  actual  changing  of  the  dual  variables  in  Phase  2  then  consists  of 


each  processor  replacing  its  own  set  of  column  variables  v^  with  v  ,  jcT,  and 
of  replacing  the  centrally  stored  row  variables  uj  with  us . 

•  •  * 

As  to  the  changing  of  the  assignment,  the  operation  A  =  A  ®(  u  P  ), 

h 

hex 

where  M  is  the  set  of  actual  augmenting  paths  generated,  is  executed  on  the 
predecessor/successor  list  that  stores  the  current  assignment. 


7.  Computational  Results 

Our  parallel  shortest  augmenting  path  algorithm  was  implemented  in  the 
programming  language  C  on  a  14  processor  BBN  Butterfly  Plus  computer,  with  56 
megabytes  of  shared  memory.  The  Butterfly  Plus  is  a  non-uniform  memory 
multiprocessor  consisting  of  Motorola  68020/68881  processors  accessing  4 
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megabytes  of  local  memory  each,  and  nonlocai  (or  remote)  memory  through  a 
packet  switched  network.  Remote  memory  access  is  channelled  through  a  switch 
and  is  therefore  slower  than  local  memory  access.  The  Butterfly  does  not  allow 
simultaneous  access  to  individual  memory  locations.  When  two  or  more  requests 
are  made  for  reading  a  memory  location  only  one  access  is  serviced.  The  other 
requests  must  be  retried  at  a  later  time.  A  more  complete  description  of  the 
Butterfly  architecture  may  be  found  in  [25]. 

Table  1  contains  a  summary  of  the  computational  results  obtained  by 
solving  fully  dense  assignment  problems  on  graphs  ranging  in  size  from  1000  to 
30,000  nodes.  This  is  to  our  knowledge  the  first  time  that  problems  of  this 
size  were  solved.  The  costs  for  these  problems  were  drawn  from  a  uniform 
distribution  of  the  integers  in  the  range  [0,100],  [0,1000],  and  [0,10000], 
The  statistics  represent  an  average  of  three  problems  for  each  size  and  cost 
range.  The  column  headings  in  Table  1  have  the  following  meaning: 
n  =  | S |  =  | T | .  The  number  of  initial  assignments  is  the  number  of  assigned 
nodes  at  the  end  of  Phase  0.  Setup  time  is  the  time  required  to  initialize 
the  dual  variables,  construct  the  admissible  graph  G,  and  derive  the  sparse 
cost  matrix  from  the  original  one.  Initial  matching  time  and  augmenting  path 
time  are  the  times  used  for  those  respective  operations.  Optimality  check 
time  is  the  time  required  to  check  that  the  optimal  matching  found  using  the 
sparse  matrix  is  dual  feasible  on  the  complete  cost  matrix.  Finally,  total 
time  is  the  execution  time  of  the  complete  algorithm. 

The  data  of  Table  1  show  that  the  setup  time  and  augmenting  path  time 
together  account  for  roughly  4/5  of  the  total  time,  with  the  augmenting  path 
time  alone  taking  up  between  1/2  and  2/3  of  total  time,  except  for  the 
smallest  cost  range.  While  the  algorithm  that  finds  an  initial  matching  is 
run  sequentially  by  a  single  processor,  it  typically  requires  3-13JS  of  total 
time,  with  the  exception  of  the  cost  range  [0,100],  where  there  is  a  large 
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numoer  of  optimal  solutions  and  the  initial  matching  turns  out  to  be  optimal 
in  every  case.  As  the  data  for  the  [0,1000]  and  [0,10000]  cost  ranges  show, 
the  initial  matching  routine  determines  an  increasing  fraction  of  the  total 
number  of  assignments  as  the  problem  size  becomes  large  relative  to  the  cost 
range.  The  [0,100]  cost  range  results  represent  a  culmination  of  this  effect, 
in  that  the  shortest  augmenting  path  procedure  becomes  in  this  case 
unnecessary  for  determining  an  optimal  solution. 

The  problems  of  size  10000,  20000,  and  30000  shown  in  the  last  segment 
of  Table  1  were  solved  using  a  special  version  of  the  algorithm.  This  version 
generates  the  dense  cost  matrix  a  row  at  a  time  on  each  processor,  determines 
a  sparse  row  from  each  of  these  full  rows,  and  discards  the  full  rows  so  that 
the  complete  cost  matrix  does  not  have  to  be  stored  in  memory  at  any  one 
time.  After  aui  optimal  solution  is  found  on  the  sparse  cost  matrix,  dual 
feasibility  is  verified  on  the  complete  cost  matrix  by  having  processors 
regenerate  and  examine  complete  rows  one  at  a  time.  Thus  the  special  version 
of  the  algorithm  requires  two  complete  cost  matrix  generations  to  guarantee 
optimality  on  the  dense  cost  matrix.  For  the  sake  of  comparability,  the 
execution  times  reported  in  Table  1  do  not  include  matrix  generation  times  but 
do  include  the  time  required  to  determine  the  sparse  cost  matrix  and  check 
dual  feasibility. 
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Table  1.  Algorithm  Performance  with  14  Processors 


Cost  range  [0, 100] 

n 

Number  of 
initial 
assignments 

Setup 
time 
( sec ) 

Initial  Matching 
time  (sec) 

Augmenting 
path  time 
(sec) 

Optimality 
check  time 
(sec ) 

Total 

Time 

(sec) 

1000 

10C0.0 

1.33 

0.68 

- 

- 

2.01 

2000 

2000. 0 

4.05 

1.47 

- 

- 

5.52 

3000 

3000.0 

2.29 

- 

- 

10.66 

Cost  range  [0, 1000] 

n 

_ 

Number  of 
initial 
assignments 

Setup 

time 

(sec) 

Initial  Matching 
time  (sec) 

Augmenting 
path  time 
(sec) 

Optimality 
check  time 
(sec) 

Total 

Time 

(sec) 

IIIIJ 

1.98 

0.30 

6.86 

0.25 

9.39 

2000 

1856.3 

6.49 

1.21 

14.59 

0.90 

23.20 

3000 

2944. 0 

14.51 

5.02 

19.01 

ra 

Cost  range  [0,10000] 

n 

Number  of 
initial 
assignments 

Setup 

time 

(sec) 

Initial  Matching 
time  ( sec ) 

Augment i ng 
path  time 
(sec) 

Optimality 
check  time 
(sec) 

Total 

Time 

(sec) 

1000 

811.7 

1.99 

0.29 

9.  17 

0.25 

11.70 

2000 

1644.0 

6.51 

1.00 

21.69 

0.89 

30.09 

3000 

2472.7 

14.62 

2.  13 

40.33 

2.03 

59.  11 

10000 

8674. 0 

80.95 

22.66 

175.06 

37.  12 

315.79 

2000C 

18569.0 

282.25 

84.99 

717.46 

148.51 

1233.21 

30000 

29467.7 

651.08 

395.24 

1665.61 

324.47 

3036. 40 
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Table  2.  Algorithm  Performance  on  1  Processor 


Cost  range  [0,100] 


n 

Number  of 
initial 
assignments 

Setup 
t  ime 
(sec) 

Initial  Matching 
time  (sec) 

1000 

1000.0 

18.  19 

0.39 

2000 

2000.0 

74.49 

0.74 

3000 

3000.0 

171.07 

1.  11 

(sec) 


(sec)  (sec) 


172. 18 


Cost  range  [0,1000] 


Cost  range  [0,10000] 


n 

Number  of 
initial 
assignments 

Setup 

time 

(sec) 

Initial  Matching 
time  (sec) 

Augmenting 
path  time 
( sec) 

Optimality 
check  time 
(sec) 

Total 

Time 

(sec) 

1000 

811.7 

25.77 

0.28 

13.93 

3.67 

43.65 

2000 

1644.0 

108.28 

0.97 

48.99 

14.26 

172.50 

3000 

2472.7 

258.55 

2.  10 

117.96 

37.64 

416.25 

Table  3.  Algorithm  Speedup 


Cost  Range  (0,100] 

n 

Setup 

Augmenting 

Path 

Optimality 

Check 

Global 

1000 

13.68 

- 

- 

9.24 

2000 

18.39 

- 

- 

13.62 

3000 

20.  44 

- 

- 

16.  15 

Cost  Range  (0,1000] 

n 

Setup 

Augmenting 

Path 

Optimality 

Check 

Global 

1000 

12.98 

1.48 

. 

14.  G8 

4.24 

2000 

16.67 

2.22 

15.89 

6.72 

3000 

17.81 

1.55 

18.43 

8.  11 

Cost  Range  [0,10000] 

n 

Augmenting 

Path 

Optimality 

Check 

Global 

1000 

12.94 

1.51 

14.68 

3.73 

2000 

16.63 

2.26 

16.02 

5.73 

I’"  ' 

17.68 

2.92 

— 

18.  54 

7.04 

Table  2  contains  a  summary  of  the  computational  results  obtained  by 
solving  the  same  fully  dense  assignment  problems  shown  in  Table  1  using  a 
single  processor.  The  performance  data  shown  in  Table  2  were  obtained  by 
distributing  the  complete  cost  matrix  across  all  14  processor  memories  but 
using  only  a  single  processor  to  execute  the  algorithm.  Such  a  partitioning 
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approach  was  necessary  since  the  complete  cost  matrix  could  not  be  stored  in 
any  one  processor  memory. 

Table  3  shows  the  speedup  of  the  Individual  portions  of  the  algorithm  as 
well  as  the  overall  speedup.  Speedup  is  defined  here  to  be  the  sequential 
execution  time  shown  in  Table  2  divided  by  the  parallel  execution  time  shown 
in  Table  1.  As  Table  3  shows,  the  portions  of  the  algorithm  accounted  for  by 
the  setup  times  and  by  the  optimality  check  times  yield  high  speedups.  In 
fact,  for  many  entries  of  Table  3,  the  speedup  actually  exceeds  the  number  of 
processors..  These  superl inear  speedups  represent  a  distortion  mostly  due  to 
the  matrix  partitioning  strategy  described  in  the  previous  paragraph.  A 
processor  executing  the  algorithm  sequentially  must  make  a  large  number  of 
nonlocal  cost  matrix  accesses  during  the  setup  and  optimality  check  phases. 
During  the  multiprocessor  execution,  processors  always  work  out  of  local 
memory  during  the  setup  and  optimality  phases.  Thus  the  sequential  execution 
is  penalized  by  the  nonlocal  memory  accesses. 

Table  3  also  shows  that  the  speedup  for  the  augmenting  path  phase  is 
substantially  less  than  the  number  of  processors.  One  reason  for  this  is  the 
need  for  synchronization.  During  any  Iteration  of  the  algorithm,  all 
processors  must  wait  for  the  slowest  processor  before  moving  to  the  next  phase 
of  the  algorithm,  and  the  length  of  the  interval  between  two  consecutive 
synchronizations  is  limited  by  the  need  to  keep  disjoint  the  potential 
augmenting  paths  constructed.  Furthermore,  towards  the  end  of  the  algorithm, 
when  the  number  of  unassigned  nodes  is  less  than  the  number  of  processors, 
some  of  the  processors  must  lie  idle.  This  phenomenon  is  exacerbated  by  the 
fact  that  the  last  few  augmenting  paths  always  take  many  times  longer  to  find 
than  the  earlier  ones. 

Despite  the  lower  speedup  for  the  augmenting  path  phase,  the  overall 
algorithm  speedup  remains  quite  good.  In  fact,  Table  3  indicates  that  both 
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the  augmenting  path  speedup  and  overall  speedup  increase  with  problem  size  for 
all  three  cost  ranges. 

Other  parallel  algorithms  for  the  assignment  problem  have  been  proposed 
and/or  implemented  by  Bertsekas  [5]  and  Hatay  [14],  based  on  the  auction 
method,  and  by  Kennington  and  Wang  [16],  based  on  the  shortest  augmenting  path 
method.  Bertsekas’  implementation  is  a  simulated  parallel  algorithm  run  on  a 
sequential  VAX  11/750  on  sparse  assignment  problems  ranging  in  size  from  500 
to  2500  nodes.  Speedups  of  about  10  were  obtained  for  a  number  of  virtual 
processors  equal  to  problem  size.  Hatay  has  implemented  the  auction  algorithm 
on  a  20-processor  Sequent  Balance  21000.  For  fully  dense  problems  of  size 
1600  auid  cost  range  [0,10000]  speedups  are  reported  to  be  about  4  and  7  for  5 
and  10  processors,  respectively,  with  a  decrease  for  a  larger  number  of 
processors.  On  the  20  processor  computer,  the  fastest  reported  solution  time 
(48  seconds)  was  obtained  when  using  only  10  processors. 

Kennington  and  Wang’s  implementation  of  the  shortest  augmenting  path 
algorithm  for  an  8-processor  Sequent  Symmetry  S81  is  based  on  several 
processors  simultaneously  constructing  an  augmenting  path.  Tested  on  problems 
of  size  800  to  1200,  the  algorithm  obtains  speedups  between  2.73  and  7.64  for 
the  cost  range  [0,100],  between  2.24  and  5.63  for  the  cost  range  [0,1000],  and 
between  2.46  and  4.89  for  the  cost  range  [0,10000].  Kennington  and  Wang  ran 
extensive  comparisons  of  the  shortest  augmenting  path  algorithm  with  the 
auction  algorithm,  and  concluded  that  for  dense  assignment  problems  the  former 
dominates  the  latter. 

Finally,  a  parallel  version  of  the  primal  simplex  method  for  the 
transportation  problem  was  implemented  by  Miller,  Pekny  and  Thompson  [23]  and 
tested  on  the  same  14-processor  Butterfly  Plus  computer  on  which  our  algorithm 
was  run.  This  code  of  course  is  meant  for  a  more  general  problem,  but  it  can 
be  used  to  solve  assignment  problems.  For  assignment  problems  of  size  3000 
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with  costs  in  the  ranges  [0,1000!  and  [0,10000],  the  speedups  are  about  7,  but 
the  absolute  times  are  about  500  and  900  seconds,  respectively,  for  the  two 
ranges . 

8.  Application  to  the  Asymmetric  Traveling  Salesman  Problem 

One  well  known  application  of  the  assignment  problem  is  its  use  in 
branch  and  bound  algorithms  to  solve  the  traveling  salesman  problem  on  a 
directed  graph,  also  called  the  asymmetric  traveling  salesman  problem  (ATSP) 
(see  [1]  for  a  survey).  The  assignment  problem  (AP)  obtained  from  the 
standard  integer  programming  formulation  of  the  ATSP  by  removing  the  subtour 
elimination  constraints  is  a  relaxation  of  the  ATSP  whose  strength  is  best 
illustrated  by  the  fact  that  for  randomly  generated  costs  the  value  of  an 
optimal  assignment  is  within  1%  of  the  value  of  an  optimal  tour  for  problems 
with  100  nodes,  and  this  percentage  decreases  with  problem  size.  As  a 
result, applying  branch  and  bound  to  the  ATSP  with  random  costs  typically 
results  in  search  trees  of  manageable  size.  Miller  and  Pekny  [22,24]  have 
implemented  a  parallel  branch  and  bound  algorithm  for  the  ATSP,  using  the  AP 
as  a  relaxation.  Table  4  shows  the  effect  of  solving  the  AP  at  the  root  node 
of  the  search  tree  with  the  parallel  algorithm  described  in  this  paper,  as 
opposed  to  the  corresponding  sequential  algorithm.  The  data  of  the  table 
represent  average  times  for  three  problems  in  each  class,  except  for  the 
largest  size  (n  -  7500),  for  which  only  two  problems  were  run.  All  problems 
up  to  n  *  3000  were  run  on  the  14-processor  Butterfly  Plus,  while  the  problems 
for  n  =  5000  and  n  =  7500  were  run  on  a  100-processor  Butterfly  Plus. 
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Table  4.  Performance  of  Parallel  Branch  and  Bound  Algorithm 
for  the  Asymmetric  Traveling  Salesman  Problem  on  14  Processors 


Cost  Range  [0,1000] 

n 

Execution  time  (sec) 
with  parallel 
assignment  algorithm 

Execution  time  (sec) 
with  sequential 
assignment  algorithm 

1000 

71. 33 

101. 81 

1500 

137.62 

213.21 

2000 

272. 98 

405. 81 

2500 

585.09 

784. 45 

3000 

451. 71 

740.57 

Cost  Range  [0, 10000] 

n 

Execution  time  (sec) 
with  parallel 
assignment  algorithm 

Execution  time  (sec) 
with  sequential 
assignment  algorithm 

1000 

249.09 

326.04 

1500 

302.31 

379.97 

2000 

1017.05 

1159.46 

2500 

777.  19 

1007.24 

J 

3000 

1392.  12 

1749.26 

5000* 

1606. 30 

— 

• 

7500 

8809. 40 

— 

100-processor  Butterfly  Plus 
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